Comparing Variability in Ultra Trail Running Results Across Countries

using coefficient of variation to compare variation in race results

By Brendan Graham in tidy tuesday

January 2, 2022

Introduction

This post looks at a past TidyTuesday data set about ultra trail running. After looking at the data I attempt to quantify and compare the variation in rankings between countries.

Races

Race data is at the race level; events are made up of races. There can be multiple event observations per year.

race %>% 
  add_table()


Example data for Run Rabbit Run event:

race %>% 
  filter(event == "RUN RABBIT RUN") %>%
  select(event, race, year) %>% 
  arrange(year) %>%
  add_table()



Some races can have thousands of participants:

race %>% 
  select(event, race, participants) %>%
  distinct() %>%
  arrange(desc(participants)) %>%
  head(10) %>%
  add_table()


Rankings

Ranking data is at the racer level; racers can appear more than once:

ultra_rankings %>% 
  add_table()
ultra_rankings %>%
  group_by(runner, nationality) %>% 
  tally(sort = T) %>%
  head(10) %>%
  add_table()


Variation in Ranking Among Countries

Rankings seem interesting, let’s try and see which countries have runners with the most consistent rankings. First we need to prep the data a little bit. We’ll set a threshold to only include countries where runners from that given country participated in at least 15 races.

country_counts <- 
  ultra_rankings %>%
  select(nationality, race_year_id) %>%
  distinct() %>%
  group_by(nationality) %>%
  tally()

runner_count <-
  ultra_rankings %>%
  select(nationality, runner) %>%
  distinct() %>%
  group_by(nationality) %>%
  tally()

quantile(country_counts$n, probs = seq(0, 1, .10))
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##   1.0   1.0   2.0   3.3   8.0  15.5  31.0  41.7  82.4 128.7 700.0
quantile(runner_count$n, probs = seq(0, 1, .10))
##      0%     10%     20%     30%     40%     50%     60%     70%     80%     90% 
##     1.0     1.0     1.0     3.0     6.0    16.0    49.4   143.1   261.6   873.0 
##    100% 
## 20345.0
top_countries <- 
  ultra_rankings %>%
  select(nationality, race_year_id) %>%
  distinct() %>%
  group_by(nationality) %>%
  tally() %>%
  filter(n > 15) %>% 
  pull(nationality)

top_countries <- 
  ultra_rankings %>%
  filter(nationality %in% top_countries) %>%
  na.omit() %>%
  select(nationality, runner) %>%
  distinct() %>%
  group_by(nationality) %>%
  tally() %>% 
  filter(n > 15) %>% 
  pull(nationality)

ultra_rankings %>%
  filter(nationality %in% top_countries) %>%
  filter(!is.na(rank)) %>%
  ggplot(aes(rank)) + 
  geom_histogram() + 
  ggtitle("race rankings") + 
  big_labels

Ranking is highly skewed, so we can use a modified formula for the coefficient of variation meant for log-normal data: \[ {cv_{raw}} = \sqrt{e^{s^2_{ln}} - 1} \]

where \(s_{ln}\) is the sample standard deviation of the data after a natural log transformation. Another alternative we could use the Coefficient of Quartile Variation (see below), but we’ll stick with the modified cv instead.

Coefficient of Quartile Variation

\[ QCV = [(q3 - q1)/(q3 + q1))]*100 \]

Viewing the Variation

After calculating the CV for log-normal data, we can compare each countries variation. France, Luxembourg and Korea are interesting in that they are relatively consistent, but their median ranks are very high. This suggests racers from these countries get consistently poor results

cv_log <- 
  ultra_rankings %>%
  filter(nationality %in% top_countries) %>%
  filter(!is.na(rank), !(is.na(gender))) %>%
  mutate(log_rank = log(rank)) %>% 
  group_by(nationality) %>% 
  summarise(
    mean_rank = mean(rank, na.rm = T),
    median_rank = median(rank, na.rm = T),
    q3 = quantile(x = rank, probs = .75, na.rm = T),
    q1 = quantile(x = rank, probs = .25, na.rm = T),
    sd_rank = sd(log_rank, na.rm = T),
    cv_log = sqrt((exp(1)^(sd_rank^2)) - 1),
    qcv = ((q3 - q1)/(q3 + q1))*100
  )

cv_log %>%
  ggplot(., aes(x = reorder(nationality, cv_log), y = cv_log, label = round(mean_rank, 2))) + 
  geom_point(size = 3.5, alpha = .85, aes(color = mean_rank)) + 
  scale_color_distiller(direction = -1, palette = "BuPu") +
  coord_flip() +
  labs(y = "Coefficient of Variation", x = "", color = "Mean Rank",
       title = "Coefficient of Variation in Race Rankings") + 
  big_labels

Let’s see if this CV aligns with the distributions of the top 2 and bottom 2 ranked countries. We can compare the ranking distributions of THA and USA with that of EDU and EST. The plot shows that ranking distribution for THA and USA are not as skewed as ECU and EST

ultra_rankings %>%
  filter(nationality %in% top_countries) %>%
  filter(!is.na(rank)) %>%
  filter(nationality %in% c("USA", "THA", "EST", "ECU")) %>% 
  ggplot(aes(x = rank, fill = nationality)) + 
  geom_histogram(position = "stack", show.legend = FALSE, alpha = .55) + 
  facet_wrap(vars(nationality), scales = 'free') +
  big_labels + 
  scale_fill_brewer(palette = "Dark2") 

Plotting each country’s mean rank vs CV shows there is slight correlation; countries whose runners are more consistent tend to have better ranking on average.

cv_log %>%
  ggplot(., aes(x = cv_log, y = mean_rank, label = nationality, 2)) + 
  geom_text_repel() + 
  geom_point() + 
  geom_smooth(method = "gam", se = F) +
  labs(x = "Coefficient of Variation", y = "Mean Rank")  + 
  scale_y_continuous(breaks = seq(0, 800, 25)) +
  scale_x_continuous(limits = c(1.5, 10), breaks = seq(0, 10, 1)) + 
  big_labels

Posted on:
January 2, 2022
Length:
1887 minute read, 401831 words
Categories:
tidy tuesday
Tags:
tidy tuesday
See Also:
Combining 'Traditional' and Text-Based Models to Board Game Ratings
Trying out new `Tidymodels` package: {vetiver}
Comparing Traditional and Text-Based Models to Predict Chocolate Rating